File connector

In earlier versions of Hyperscale Compliance, Delimited and Parquet connectors were released as separate connectors. From the 23.0.0 version of Hyperscale Compliance onwards, Delimited and Parquet connectors are merged into a single file connector.
The connector can be used to mask large delimited and parquet files. The file connector’s unload service splits the large files into smaller chunks and passes them onto the masking service. After the masking is completed, the files are sent to the load service which joins back the split files (the end user also has a choice to disable the join operation).
Delphix supports Delimted and Parquet as file types for file connectors. The splitting and joining of files are handled by the writers embedded in the file connector. You can choose an appropriate writer for your operations by setting an environment variable DATA_WRITER_TYPE. Users can also skip the processing of source files in unload service by enabling the staging push feature, setting an environment variable SKIP_UNLOAD_WRITERS to true will enable the staging push at the unload service. Users can also facilitate staging push in load service by setting up the environment variable to SKIP_LOAD_WRITERS.
The supported data writers are:

pyarrow
pyspark
cat

You can refer to the below matrix when choosing the writer type.

Source type	File type	DATA_WRITER_TYPE (Unload)	DATA_WRITER_TYPE (Load)
FS	Delimited/Parquet	pyarrow(default)/pyspark	pyarrow/cat (default)
AWS	Delimited/Parquet	pyarrow(default)/pyspark	pyspark(default)
HADOOP-DB/HADOOP-FS	Delimited/Parquet	pyarrow*	pyarrow*
HADOOP-DB/HADOOP-FS	Delimited/Parquet	pyspark	pyspark
S3	Delimited/Parquet	pyspark	pyspark

*If the source is Hadoop and the user is willing to use pyarrow as a writer, they must provide the Hadoop client as a volume mount.

Masking Parquet Files:When working with Parquet files, you must use PySpark as the writer type. Set "writer_type": "pyspark" in both the source and target configurations. Refer to the example below.

Legacy Parquet Masking:In versions prior to 2025.6.0, Parquet files were first converted to CSV before masking, as masking support for Parquet was not available. To enable this legacy behavior, set "skip_parquet_to_csv_conversion": false in source_config when creating the job. With this option, you may use either PyArrow or PySpark as the writer type.

PySpark writer type example:

Copy

"source_configs": {
    "writer_type": "pyspark"
},
"target_configs": {
    "writer_type": "pyspark"
}

If Parquet file type contains complex data types, then use pyspark for both services.
If Delimited file type has a target location in an S3 bucket, then use pyspark for both services.
For Hadoop, ensure that you use same writer for both the services.

Prerequisites

The source and target (NFS) locations have to be mounted onto the docker containers of unload and load service. Please note that the locations on the containers are what needs to be used when creating the connector-info’s using the controller.

Copy

# As an example
unload-service:
     image: delphix-file-connector-unload-service-app:<HYPERSCALE VERSION> 
     ...
     volumes:
          ...
          - /path/to/nfs/mounted/source1/files:/mnt/source1
          - /path/to/nfs/mounted/source2/files:/mnt/source2
...
load-service:
     image: delphix-file-connector-load-service-app:<HYPERSCALE VERSION> 
     ...
     volumes:
          ...
          - /path/to/nfs/mounted/target1/files:/mnt/target1
          - /path/to/nfs/mounted/target2/files:/mnt/target2

Set the required data writer using the DATA_WRITER_TYPE environment variable.

Copy

unload-service:
     image: delphix-file-connector-unload-service-app:<HYPERSCALE VERSION> 
     ...
     volumes:
          ...
          - DATA_WRITER_TYPE=pyspark
...
load-service:
     image: delphix-file-connector-load-service-app:<HYPERSCALE VERSION> 
     ...
     environment:
          ...
          - DATA_WRITER_TYPE=pyspark

The connector should be able to access the AWS S3 buckets (the source and target locations). The following approaches are supported by the connector and can be used to authenticate with the S3 bucket:

Attaching the IAM role to the EC2 instance where the hyperscale masking services will be deployed.
- IAM Roles are designed for applications to securely make AWS-API requests from EC2 instances, without the necessity to manage the security credentials that the applications use.
  - Using the AWS console UI or AWS CLI, attach the IAM role to the EC2 instance running the Hyperscale services. To know more, check the AWS Documentation.
  - With IAM role authentication, there is no need to pass the AWS credentials during the connector-info creation.
    Copy
    # Example connector-info payload { "source": { "type": "AWS", "properties": { "server": "S3", "path": "aws_s3_bucket/sub_folder(s)" } }, "target": { "type": "AWS", "properties": { "server": "S3", "path": "aws_s3_bucket/sub_folder(s)" } } }

Passing the AWS Access Key ID & AWS Secret Access Key attached to an AWS role:

Access keys are long-term credentials generated for an IAM user or role. These keys can be for programmatic requests to the AWS CLI or AWS API (directly or using the AWS SDK). To know more, check the AWS Documentation.

These credentials can be passed during the connector-info creation.

Copy

# Example connector-info payload
{
  "source": {
    "type": "AWS",
    "properties": {
        "server": "S3",
        "path": "aws_s3_bucket/sub_folder(s)",
        "aws_region": "us-west-2",
        "aws_access_key_id": "AWS_ACCESS_KEY_ID",
        "aws_secret_access_key": "AWS_SECRET_ACCESS_KEY"
    }
  },
  "target": {
    "type": "AWS",
    "properties": {
        "server": "S3",
        "path": "aws_s3_bucket/sub_folder(s)",
        "aws_region": "us-west-2",
        "aws_access_key_id": "AWS_ACCESS_KEY_ID",
        "aws_secret_access_key": "AWS_SECRET_ACCESS_KEY"
    }
  }
}

They can also be set as environment variables when bringing up the connector services.

Copy

unload-service:
    ...
    environment:
      - AWS_DEFAULT_REGION=us-east-1
      - AWS_ACCESS_KEY_ID=<aws_access_key_id>
      - AWS_SECRET_ACCESS_KEY=<aws_secret_access_key>

  ...
  load-service:
    ...
    environment:
      - AWS_DEFAULT_REGION=us-east-1
      - AWS_ACCESS_KEY_ID=<aws_access_key_id>
      - AWS_SECRET_ACCESS_KEY=<aws_secret_access_key>

Set the required data writer using the DATA_WRITER_TYPE environment variable.

Copy

unload-service:
     image: delphix-file-connector-unload-service-app:<HYPERSCALE VERSION> 
     ...
     volumes:
          ...
          - DATA_WRITER_TYPE=pyspark
...
load-service:
     image: delphix-file-connector-load-service-app:<HYPERSCALE VERSION> 
     ...
     environment:
          ...
          - DATA_WRITER_TYPE=pyspark

Set SKIP_UNLOAD_WRITERS and SKIP_LOAD_WRITERS environment variables to true to enable staging push.

Copy

unload-service:
     image: delphix-file-connector-unload-service-app:<HYPERSCALE VERSION> 
     ...
     volumes:
          ...
          - SKIP_UNLOAD_WRITERS=true
...
load-service:
     image: delphix-file-connector-load-service-app:<HYPERSCALE VERSION> 
     ...
     environment:
          ...
          - SKIP_LOAD_WRITERS=true

In case the source is HADOOP-DB, the connector requires Kerberos authentication to access the Hadoop cluster. Users must provide a principal name and keytab file that grant access to both the Hive database and the underlying HDFS. In case source is HADOOP-FS, the Kerberos principal name is necessary to retrieve files from the specified HDFS path. In both cases, the target location is always an HDFS file system. Below are the pre-requisites for using Hadoop as a connector type:

The user has to volume mount the core-site.xml, hadoop.keytab and krb5.conf in unload and load services. Below is an example of the same.
If errors occur during the unload or load processes, retry any failed jobs using only the minimal viable configuration for the core-site.xml and hdfs-site.xml mounted on Hyperscale.

Copy

# As an example
unload-service:
     image: delphix-file-connector-unload-service-app:<HYPERSCALE VERSION> 
     ...
     volumes:
          ...
          - /path/to/hadoop/keytab/hdfs.keytab:/app/hadoop.keytab
          - /path/to/hadoop/krb5.conf:/etc/krb5.conf
          - /path/to/hadoop/core-site.xml:/app/hadoop/etc/hadoop/core-site.xml
          - /path/to/hadoop/hdfs-site.xml:/app/hadoop/etc/hadoop/hdfs-site.xml
...
load-service:
     image: delphix-file-connector-load-service-app:<HYPERSCALE VERSION> 
     ...
     volumes:
          ...
          - /path/to/hadoop/keytab/hdfs.keytab:/app/hadoop.keytab
          - /path/to/hadoop/krb5.conf:/etc/krb5.conf
          - /path/to/hadoop/core-site.xml:/app/hadoop/etc/hadoop/core-site.xml
          - /path/to/hadoop/hdfs-site.xml:/app/hadoop/etc/hadoop/hdfs-site.xml

In case the user is using DATA_WRITER_TYPE as pyarrow then they also have to provide Hadoop client location as volume mount in addition to above volume mounts in unload and load services. Below is example of same.
Copy
```
- /path/to/hadoop-client/hadoop:/app/hadoop
```
```
 
```

Below are a few connector info examples with hadoop as source and target:

The source and target are HADOOP-FS

Copy

{
  "connectorName": "Hadoop_Hdfs_connector",
  "source": {
    "type": "HADOOP-FS",
    "properties": {
      "server": "<Hadoop server hostname>",
      "port": "<HDFS port>",
      "path": "Source file location",
      "principal_name": "<Kerberos principal name>",
      "protocol": "hdfs"
    }
  },
  "target": {
    "type": "HADOOP-FS",
    "properties": {
      "server": "<Hadoop server hostname>",
      "path": "<target file location in HDFS>",
      "protocol": "hdfs",
      "port": "<HDFS port>",
      "principal_name": "<Kerberos principal name>"
    }
  }
}

The source is HADOOP-DB and target is HADOOP-FS

Copy

{
  "connectorName": "Hadoop_Hive_connector",
  "source": {
    "type": "HADOOP-DB",
    "properties": {
      "server": "<Hadoop server hostname>",
      "database_name": "<database name where the tables are present>",
      "database_type": "hive",
      "database_port": "<Thrift SSL port for Apache hive>",
      "principal_name": "<Kerberos principal name>",
      "protocol": "hdfs"
    }
  },
  "target": {
    "type": "HADOOP-FS",
    "properties": {
      "server": "<Hadoop server hostname>",
      "path": "<target file location in HDFS>",
      "protocol": "hdfs",
      "port": "<HDFS port>",
      "principal_name": "<Kerberos principal name>"
    }
  }
}

Minimum hdfs-site.xml configurations

Copy

<?xml version="1.0" encoding="UTF-8"?>  
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  
<configuration>  
  <property>  
    <name>dfs.nameservices</name>  
    <value>sourcecluster</value>  
  </property>  
  <property>  
    <name>dfs.ha.namenodes.sourcecluster</name>  
    <value>nn1,nn2</value>  
  </property>  
  <property>  
    <name>dfs.namenode.rpc-address.sourcecluster.nn1</name>  
    <value>mm-hadoop.dlpxdc.co:8020</value>  
  </property>  
  <property>  
    <name>dfs.namenode.rpc-address.sourcecluster.nn2</name>  
    <value>mm-hadoop-s2.dlpxdc.co:8020</value>  
  </property>  
  <property>  
    <name>dfs.client.failover.proxy.provider.sourcecluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>  
  </property>  
</configuration>

Minimum core-site.xml configurations

Copy

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <!-- Kerberos authentication -->
  <property>
    <name>hadoop.security.authentication</name>
    <value>kerberos</value>
  </property>
  <property>
    <name>hadoop.security.authorization</name>
    <value>true</value>
  </property>
</configuration>

Property values

Property	Value
SOURCE_KEY_FIELD_NAMES	unique_source_files_identifier
LOAD_SERVICE_REQUIREPOSTLOAD	false
DATA_WRITER_TYPE	“pyarrow” (Default for delimited-unload-service) “pyspark” “cat” (Default as well as only applicable to delimited-load-service)
UNLOAD_SPARK_DRIVER_MEMORY	90% of available memory
UNLOAD_SPARK_DRIVER_CORES	90% of available cores
MAX_WORKER_THREADS_PER_JOB	512
SKIP_UNLOAD_WRITERS	false
SKIP_LOAD_WRITERS	false

For default values, see Configuration settings.

Supported compression types for Parquet files

SNAPPY
GZIP
BROTLI
ZSTD
LZ4
UNCOMPRESSED

Supported data types

The following are the supported data types for delimited files:

String/Text
Double
Int64
Timestamp

The following are the supported data types for parquet files :

BOOLEAN
INT32
INT64
INT96
FLOAT
DOUBLE
BYTE_ARRAY

In-place masking support

The in-place masking functionality for file connectors is a powerful feature that allows you to mask sensitive data directly at its source location. When you enable in-place masking:

The file connector reads the original data from your specified source location.
The masking process is applied to this original data.
During the load operation, the connector writes the masked data to the target location.

If you specify "/" as the target location in the connector information payload, the masked data will be written directly back to the original source location, overwriting your unmasked data.
Therefore, you must exercise extreme care when setting "/" as the target location. Accidentally providing this value will result in your original source data being permanently replaced with the masked version.

Known limitations

Delimited file types

Supports only Single-character ASCII delimiters
The end-of-record character can only be \n, \r, or \r\n.
Limitations with PyArrow Data Writer:
- Output files will exclusively enclose all string types with double quotes (“).
  Col umns with double data types will be converted to strings. For example, 6377974237282886994505 will be converted to “36377974237282886994505”.
- Columns with int64 data type will be converted to strings. For example, 0009435304391722556805 will be converted to “00009435304391722556805".
Limitation with PySpark data writer:
- PySpark is more memory intensive, so in case we are processing data that is more in size in comparison to the available memory then we may run into issues related to resource exhaustion. Caution: The size of split files multiplied by the number of cores must not exceed the system memory.
- With PyAarrow as the data writer, the split files are generated one after the other, so the masking-service is called as and when a split is created. With PySpark as the data writer, all split files are available only after the split process is complete. So the masking service will be only called after all splits are completed. Due to this, the overall time taken to complete the hyperscale masking execution will be more compared to the former.
- There is a possibility that the number of splits created in the end will be less than the requested number, this generally happens when the file size is small, and spark doesn’t create as many partitions as the requested split number.

Parquet file types

There's a limitation in the File Connector related to Parquet file compatibility. The connector does not currently support the following Parquet data type:

INT64 (TIME(MICROS,false))
INT64 (TIMESTAMP(NANOS,false))

When Parquet files include columns defined with this specific logical type, file ingestion or processing may fail or result in unexpected behaviour.

Generally, the parquet files are compressed and the compression factor could vary from 2x to 70x or even more. So, when working with such larger files the connector will need a host which has large enough memory to accommodate the parallel execution of multiple large parquet files. In case the sum of the uncompressed size of parquet files that are getting executed in parallel exceeds 80% of RAM size then the chances of having an “out of memory” error are high. To avoid OOM, the end user can reduce the MAX_WORKER_THREADS_PER_JOB (i.e. reduce the number of parallel threads), ultimately reducing the memory usage.
Struct and list type values are treated as strings for the Delphix Continuous Compliance Engine, therefore, you can not add individual elements of any struct and list to the masking inventory property of the dataset payload.

S3 connector
The S3 connector only supports path-style URLs and does not support virtual-hosted–style URLs.

File connector

Prerequisites

Minimum hdfs-site.xml configurations

Minimum core-site.xml configurations

Property values

Supported compression types for Parquet files

Supported data types

In-place masking support

Known limitations

Delimited file types

Parquet file types

S3 connector